require(data.table)
require(lubridate)
require(forecast)
require(skimr)
require(repr)
require(readxl)
require(ggplot2)
require(tidyverse)
require(GGally)
options(repr.plot.width=12.7, repr.plot.height=8.5)
This project focuses on providing hourly solar power predictions for Edikli GES (Güneş Enerjisi Santrali) located in Niğde, at coordinates 38.29° North and 34.97° East. The objective is to forecast solar energy production for day d+1 based on the production data available up to the end of day d-1.
The forecasts cover the period from May 13th to June 26th. Since the data used for each day’s production includes the production information up to two days prior, the data is updated daily within this date range.
In the forecasting model, both the production data of previous days and weather information for the respective location are utilized. Weather information variables are stored as follows: DSWRF_surface (downward shortwave radiation flux), USWRF_top_of_atmosphere, USWRF_surface, DLWRF_surface (solar radiation-related variables), TCDC_low.cloud.layer, TCDC_middle.cloud.layer, TCDC_high.cloud.layer, TCDC_entire.atmosphere (total cloud cover data for different cloud types), CSNOW_surface (categorical snow variable, indicating presence or absence of snow), and TMP_surface (temperature at the location).
During the project development process, we explored various methods including clustering (grouping similar structured production hours) and forecasting daily total production predictions followed by disaggregating them into 24 hours. However, upon evaluating the test metrics, we found that the best results were achieved by creating separate hourly models. Therefore, the main focus of our project is to build a model for each hour, identifying and incorporating the important predictors for that specific hour. Since there is no significant production between hours 19 and 4, we assumed the production to be zero for those 10 hours and prepared distinct models for the remaining 14 hours.
Production data is imported from the desktop, and any duplicate production data is removed. Production data after May 15th is excluded. Weather data is imported and converted from long to wide format, and the adjusted data is ordered as a precaution. Following this, the averages of each variable are calculated for different coordinates. Since each variable has 25 coordinates, which are highly correlated, averaging them seemed logical. Subsequently, the weather data and production data are combined into a single data table. The month, day, and hour information is extracted as a factor, as they may also be useful for model creation. As per instructions to evaluate until May 15th, both the weather and production data are disregarded after that date.
data_path = "/Users/eylulruyagullu/Desktop/production_may30.csv"
production_data = fread(data_path)
unique_production <- production_data[!duplicated(production_data[, c("date", "hour")]), ]
unique_production <- unique_production[-c((nrow(unique_production)-15*24+1):nrow(unique_production)), ]
data_path <- "/Users/eylulruyagullu/Desktop/processed_weather_may30.csv"
weather_data <- fread(data_path)
weather_data_modified <- weather_data %>%
pivot_wider(names_from = c(lat, lon), values_from = c(dswrf_surface, tcdc_low.cloud.layer, tcdc_middle.cloud.layer, tcdc_high.cloud.layer, tcdc_entire.atmosphere, uswrf_top_of_atmosphere, csnow_surface, dlwrf_surface, uswrf_surface, tmp_surface), names_sep = "_")
weather_modified_ordered <- weather_data_modified[order(weather_data_modified$date, weather_data_modified$hour), ]
weather_modified_ordered$dswrf_surface_avg <- rowMeans(weather_modified_ordered[, 3:27])
weather_modified_ordered$tcdc_low.cloud.layer_avg <- rowMeans(weather_modified_ordered[, 28:52])
weather_modified_ordered$tcdc_middle.cloud.layer_avg <- rowMeans(weather_modified_ordered[, 53:77])
weather_modified_ordered$tcdc_high.cloud.layer_avg <- rowMeans(weather_modified_ordered[, 78:102])
weather_modified_ordered$tcdc_entire.atmosphere_avg <- rowMeans(weather_modified_ordered[, 103:127])
weather_modified_ordered$uswrf_top_of_atmosphere_avg <- rowMeans(weather_modified_ordered[, 128:152])
weather_modified_ordered$csnow_surface_avg <- rowMeans(weather_modified_ordered[, 153:177])
weather_modified_ordered$dlwrf_surface_avg <- rowMeans(weather_modified_ordered[, 178:202])
weather_modified_ordered$uswrf_surface_avg <- rowMeans(weather_modified_ordered[, 203:227])
weather_modified_ordered$tmp_surface_avg <- rowMeans(weather_modified_ordered[, 228:252])
weather_data_final <- cbind(weather_modified_ordered[, 1:2], weather_modified_ordered[, (ncol(weather_modified_ordered)-9):ncol(weather_modified_ordered)])
weather_data_final <- weather_data_final[-c(nrow(unique_production):nrow(weather_data_final))]
selected_columns_production <- unique_production[, c("date", "hour", "production")]
selected_columns_weather <- weather_data_final[1:nrow(unique_production), c("dswrf_surface_avg", "tcdc_low.cloud.layer_avg", "tcdc_middle.cloud.layer_avg", "tcdc_high.cloud.layer_avg", "tcdc_entire.atmosphere_avg", "uswrf_top_of_atmosphere_avg", "csnow_surface_avg", "dlwrf_surface_avg", "uswrf_surface_avg", "tmp_surface_avg")]
selected_rows <- weather_data_final[(nrow(unique_production) + 1):(nrow(weather_data_final)),]
merged_data <- cbind(selected_columns_production, selected_columns_weather)
merged_data[, date:=as.Date(date)]
# Create a new column which that makes month, day and hour as a factor
merged_data[, hour := as.factor(hour)]
merged_data[, month := factor(month(date), levels = 1:12)]
merged_data[, day := day(date)]
Also, a function is defined in order to evaluate the models.
accu=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
CV=sd/mean
FBias=sum(error)/sum(actual)
MAPE=sum(abs(error/actual))/n
RMSE=sqrt(sum(error^2)/n)
MAD=sum(abs(error))/n
MADP=sum(abs(error))/sum(abs(actual))
WMAPE=MAD/mean
l=data.frame(n,mean,sd,CV,FBias,MAPE,RMSE,MAD,MADP,WMAPE)
return(l)}
When plotting production against the data, we observe a cutoff at a production level of 10. This pattern holds true except for the second half of 2022, which might be attributed to an additional production permit or inaccurate production information. Consequently, we can conclude that our production data predominantly falls between 0 and 10.
ggplot(merged_data, aes(x=date)) + geom_line(aes(y=production, color="Production")) +
labs(title = "Production vs Time", x = "Date",y = "Production")
The autocorrelation function of the production data reveals significant daily seasonality with a lag of 24 hours. Specifically, lag-24 exhibits a strong positive autocorrelation, whereas lag-12 shows a strong negative autocorrelation.
ggAcf(merged_data$production, lag.max = 96) + ggtitle("Production ACF")
The relationship between each predictor variable and the forecast variable is shown below.
ggpairs(merged_data[, c("production",
"dswrf_surface_avg",
"tcdc_low.cloud.layer_avg",
"tcdc_middle.cloud.layer_avg",
"tcdc_high.cloud.layer_avg",
"tcdc_entire.atmosphere_avg",
"uswrf_top_of_atmosphere_avg",
"csnow_surface_avg",
"dlwrf_surface_avg",
"uswrf_surface_avg",
"tmp_surface_avg")])
plot: [1, 1] [>-----------------------------------------------------------------------------------] 1% est: 0s
plot: [1, 2] [>-----------------------------------------------------------------------------------] 2% est: 6s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value
plot: [1, 3] [=>----------------------------------------------------------------------------------] 2% est: 8s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [1, 4] [==>---------------------------------------------------------------------------------] 3% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [1, 5] [==>---------------------------------------------------------------------------------] 4% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 4 rows containing missing values
plot: [1, 6] [===>--------------------------------------------------------------------------------] 5% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [1, 7] [====>-------------------------------------------------------------------------------] 6% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [1, 8] [=====>------------------------------------------------------------------------------] 7% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value
plot: [1, 9] [=====>------------------------------------------------------------------------------] 7% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value
plot: [1, 10] [======>----------------------------------------------------------------------------] 8% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [1, 11] [=======>---------------------------------------------------------------------------] 9% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value
plot: [2, 1] [=======>----------------------------------------------------------------------------] 10% est:10s Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).
plot: [2, 2] [========>---------------------------------------------------------------------------] 11% est:10s Warning: Removed 1 row containing non-finite outside the scale range (`stat_density()`).
plot: [2, 3] [=========>--------------------------------------------------------------------------] 12% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [2, 4] [=========>--------------------------------------------------------------------------] 12% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [2, 5] [==========>-------------------------------------------------------------------------] 13% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 4 rows containing missing values
plot: [2, 6] [===========>------------------------------------------------------------------------] 14% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [2, 7] [===========>------------------------------------------------------------------------] 15% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [2, 8] [============>-----------------------------------------------------------------------] 16% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value
plot: [2, 9] [=============>----------------------------------------------------------------------] 17% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value
plot: [2, 10] [=============>---------------------------------------------------------------------] 17% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [2, 11] [==============>--------------------------------------------------------------------] 18% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value
plot: [3, 1] [===============>--------------------------------------------------------------------] 19% est: 9s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [3, 2] [================>-------------------------------------------------------------------] 20% est:10s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [3, 3] [================>-------------------------------------------------------------------] 21% est:10s Warning: Removed 2 rows containing non-finite outside the scale range (`stat_density()`).
plot: [3, 4] [=================>------------------------------------------------------------------] 21% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [3, 5] [==================>-----------------------------------------------------------------] 22% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 5 rows containing missing values
plot: [3, 6] [==================>-----------------------------------------------------------------] 23% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 4 rows containing missing values
plot: [3, 7] [===================>----------------------------------------------------------------] 24% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 4 rows containing missing values
plot: [3, 8] [====================>---------------------------------------------------------------] 25% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [3, 9] [=====================>--------------------------------------------------------------] 26% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [3, 10] [=====================>-------------------------------------------------------------] 26% est:10s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [3, 11] [======================>------------------------------------------------------------] 27% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [4, 1] [=======================>------------------------------------------------------------] 28% est:10s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [4, 2] [=======================>------------------------------------------------------------] 29% est:10s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [4, 3] [========================>-----------------------------------------------------------] 30% est: 9s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [4, 4] [=========================>----------------------------------------------------------] 31% est: 9s Warning: Removed 2 rows containing non-finite outside the scale range (`stat_density()`).
plot: [4, 5] [=========================>----------------------------------------------------------] 31% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 5 rows containing missing values
plot: [4, 6] [==========================>---------------------------------------------------------] 32% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 4 rows containing missing values
plot: [4, 7] [===========================>--------------------------------------------------------] 33% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 4 rows containing missing values
plot: [4, 8] [===========================>--------------------------------------------------------] 34% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [4, 9] [============================>-------------------------------------------------------] 35% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [4, 10] [============================>------------------------------------------------------] 36% est: 9s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [4, 11] [=============================>-----------------------------------------------------] 36% est: 8s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [5, 1] [==============================>-----------------------------------------------------] 37% est: 8s Warning: Removed 4 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [5, 2] [===============================>----------------------------------------------------] 38% est: 8s Warning: Removed 4 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [5, 3] [================================>---------------------------------------------------] 39% est: 8s Warning: Removed 5 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [5, 4] [================================>---------------------------------------------------] 40% est: 8s Warning: Removed 5 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [5, 5] [=================================>--------------------------------------------------] 40% est: 8s Warning: Removed 4 rows containing non-finite outside the scale range (`stat_density()`).
plot: [5, 6] [==================================>-------------------------------------------------] 41% est: 8s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 6 rows containing missing values
plot: [5, 7] [==================================>-------------------------------------------------] 42% est: 8s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 6 rows containing missing values
plot: [5, 8] [===================================>------------------------------------------------] 43% est: 8s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 4 rows containing missing values
plot: [5, 9] [====================================>-----------------------------------------------] 44% est: 7s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 4 rows containing missing values
plot: [5, 10] [====================================>----------------------------------------------] 45% est: 7s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 5 rows containing missing values
plot: [5, 11] [=====================================>---------------------------------------------] 45% est: 7s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 4 rows containing missing values
plot: [6, 1] [======================================>---------------------------------------------] 46% est: 7s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [6, 2] [=======================================>--------------------------------------------] 47% est: 7s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [6, 3] [=======================================>--------------------------------------------] 48% est: 7s Warning: Removed 4 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [6, 4] [========================================>-------------------------------------------] 49% est: 7s Warning: Removed 4 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [6, 5] [=========================================>------------------------------------------] 50% est: 7s Warning: Removed 6 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [6, 6] [=========================================>------------------------------------------] 50% est: 7s Warning: Removed 3 rows containing non-finite outside the scale range (`stat_density()`).
plot: [6, 7] [==========================================>-----------------------------------------] 51% est: 7s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 5 rows containing missing values
plot: [6, 8] [===========================================>----------------------------------------] 52% est: 6s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [6, 9] [===========================================>----------------------------------------] 53% est: 6s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [6, 10] [============================================>--------------------------------------] 54% est: 6s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 4 rows containing missing values
plot: [6, 11] [============================================>--------------------------------------] 55% est: 6s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [7, 1] [==============================================>-------------------------------------] 55% est: 6s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [7, 2] [==============================================>-------------------------------------] 56% est: 6s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [7, 3] [===============================================>------------------------------------] 57% est: 6s Warning: Removed 4 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [7, 4] [================================================>-----------------------------------] 58% est: 6s Warning: Removed 4 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [7, 5] [================================================>-----------------------------------] 59% est: 5s Warning: Removed 6 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [7, 6] [=================================================>----------------------------------] 60% est: 5s Warning: Removed 5 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [7, 7] [==================================================>---------------------------------] 60% est: 5s Warning: Removed 3 rows containing non-finite outside the scale range (`stat_density()`).
plot: [7, 8] [==================================================>---------------------------------] 61% est: 5s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [7, 9] [===================================================>--------------------------------] 62% est: 5s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [7, 10] [===================================================>-------------------------------] 63% est: 5s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [7, 11] [====================================================>------------------------------] 64% est: 5s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 3 rows containing missing values
plot: [8, 1] [=====================================================>------------------------------] 64% est: 5s Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).
plot: [8, 2] [======================================================>-----------------------------] 65% est: 5s Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).
plot: [8, 3] [=======================================================>----------------------------] 66% est: 5s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [8, 4] [=======================================================>----------------------------] 67% est: 4s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [8, 5] [========================================================>---------------------------] 68% est: 4s Warning: Removed 4 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [8, 6] [=========================================================>--------------------------] 69% est: 4s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [8, 7] [=========================================================>--------------------------] 69% est: 4s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [8, 8] [==========================================================>-------------------------] 70% est: 4s Warning: Removed 1 row containing non-finite outside the scale range (`stat_density()`).
plot: [8, 9] [===========================================================>------------------------] 71% est: 4s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value
plot: [8, 10] [===========================================================>-----------------------] 72% est: 4s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [8, 11] [===========================================================>-----------------------] 73% est: 4s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value
plot: [9, 1] [=============================================================>----------------------] 74% est: 4s Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).
plot: [9, 2] [=============================================================>----------------------] 74% est: 3s Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).
plot: [9, 3] [==============================================================>---------------------] 75% est: 3s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [9, 4] [===============================================================>--------------------] 76% est: 3s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [9, 5] [================================================================>-------------------] 77% est: 3s Warning: Removed 4 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [9, 6] [================================================================>-------------------] 78% est: 3s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [9, 7] [=================================================================>------------------] 79% est: 3s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [9, 8] [==================================================================>-----------------] 79% est: 3s Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).
plot: [9, 9] [==================================================================>-----------------] 80% est: 3s Warning: Removed 1 row containing non-finite outside the scale range (`stat_density()`).
plot: [9, 10] [==================================================================>----------------] 81% est: 3s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [9, 11] [===================================================================>---------------] 82% est: 2s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removing 1 row that contained a missing value
plot: [10, 1] [====================================================================>--------------] 83% est: 2s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [10, 2] [====================================================================>--------------] 83% est: 2s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [10, 3] [=====================================================================>-------------] 84% est: 2s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [10, 4] [======================================================================>------------] 85% est: 2s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [10, 5] [======================================================================>------------] 86% est: 2s Warning: Removed 5 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [10, 6] [=======================================================================>-----------] 87% est: 2s Warning: Removed 4 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [10, 7] [========================================================================>----------] 88% est: 2s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [10, 8] [========================================================================>----------] 88% est: 2s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [10, 9] [=========================================================================>---------] 89% est: 1s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [10, 10] [=========================================================================>--------] 90% est: 1s Warning: Removed 2 rows containing non-finite outside the scale range (`stat_density()`).
plot: [10, 11] [==========================================================================>-------] 91% est: 1s Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
plot: [11, 1] [===========================================================================>-------] 92% est: 1s Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).
plot: [11, 2] [============================================================================>------] 93% est: 1s Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).
plot: [11, 3] [=============================================================================>-----] 93% est: 1s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [11, 4] [=============================================================================>-----] 94% est: 1s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [11, 5] [==============================================================================>----] 95% est: 1s Warning: Removed 4 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [11, 6] [===============================================================================>---] 96% est: 1s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [11, 7] [===============================================================================>---] 97% est: 0s Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [11, 8] [================================================================================>--] 98% est: 0s Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).
plot: [11, 9] [=================================================================================>-] 98% est: 0s Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).
plot: [11, 10] [================================================================================>-] 99% est: 0s Warning: Removed 2 rows containing missing values or values outside the scale range (`geom_point()`).
plot: [11, 11] [==================================================================================]100% est: 0s
Warning: Removed 1 row containing non-finite outside the scale range (`stat_density()`).
We initially attempted a daily forecasting model, followed by disaggregation to hourly predictions.
The process involved: Training a model to predict daily total production. Creating separate models for each hour. Predicting hourly production using the hourly models. Calculating the percentage contribution of each hour to the hourly production total. Disaggregating the daily prediction based on these percentage contributions.
While the daily model performed well, the disaggregation process proved unreliable, leading to the abandonment of this approach.
Even though we conduct a better model with daily totals rather than hourly model, dis aggregation procedure didn’t provide reliable results and we didn’t select this approach.
This model grouped hours with similar historical production patterns into clusters. The clusters defined were: (5-6 AM), (7-8 AM), (9 AM - 2 PM), (3-4 PM), and (5-6 PM). Remaining hours were assumed to have zero production. Though some clusters achieved good results, the overall performance of the hourly model proved superior.
By exploring these alternative approaches, we confirmed that the current hourly model provides the most accurate and reliable forecast for our specific needs. While the daily model with disaggregation offered a promising initial approach, the disaggregation process introduced significant errors. The clustered model, while offering some promise within individual clusters, failed to capture the overall hourly variations effectively.
We will evaluate both time series linear regression and ARIMA models in our project. We started with analyzing time series linear regression models.
Model Development: Baseline Model -We began by constructing a baseline model incorporating all available predictor variables. -Analysis of weather data and production trends revealed seasonal patterns. To address this, a month dummy variable was introduced. -Past production data also exhibited a correlation with current production. To capture this, lag values were included as predictors. Since on day d+1 we have production data until the end of day d-1, we didn’t include lag 1 in our models.
Variable Selection and Refinement: We employed various metrics to refine the initial model -Significance Levels: While prioritizing highly significant variables, we didn’t solely rely on this metric. -Variable Relationships: We considered potential interactions between variables and their impact on the model. For instance, considering the high correlation within the TCDC variables, we strategically included one or two from this group in each model iteration. -Error Metric (WMAPE): We evaluated the influence of variables on the Weighted Mean Absolute Percentage Error (WMAPE). -ACF (Autocorrelation Function): We monitored the ACF to ensure model residuals were independent. -Adjusted R-squared: This metric assessed the model’s explanatory power. -We also identified variables with minimal impact on the prediction horizon, such as CSNOW, and excluded them from the final model
Data Splitting and Evaluation: -To calculate error metrics, the data was divided into training and testing sets. -Training data spanned from January 1st, 2022 to January 31st, 2024. -Testing data covered the period from February 1st, 2024 to May 15th, 2024.
Post-Processing: -Predicted values were adjusted to account for production constraints: Production values cannot be negative. An upper bound of 10 units was applied.
In the second case, we used ARIMA model for hourly production forecasting. We utilized the auto.arima function to identify optimal parameters for the ARIMA model.
Remark: By analyzing the performance of both models, we can determine which approach offers the most accurate and reliable forecasts for hourly production. We followed mentioned steps for 14 distinct models (for hours between 5 and 18). In order to show steps in detail, we provided a sample procedure for 9 AM. The base case outputs for time series linear regression and ARIMA models at 9 AM are provided below. Information on model modifications is detailed under the output tables.
hour9 <- subset(merged_data, hour %in% c(9))
hour9$lag2_prod <- hour9[ , .(lag2_prod = shift(hour9$production,n = 2,fill = NA))]
hour9$lag3_prod <- hour9[ , .(lag3_prod = shift(hour9$production,n = 3,fill = NA))]
test_start_index <- which(hour9$date == as.Date("2024-02-01"))
train_data <- hour9[1:(test_start_index - 1), ]
test_data9 <- hour9[test_start_index:(nrow(hour9)), ]
hour9_m1_train <- lm(production ~
lag2_prod +
lag3_prod +
month +
dswrf_surface_avg +
tmp_surface_avg +
dlwrf_surface_avg +
csnow_surface_avg +
tcdc_entire.atmosphere_avg +
tcdc_high.cloud.layer_avg +
tcdc_low.cloud.layer_avg +
tcdc_middle.cloud.layer_avg +
uswrf_top_of_atmosphere_avg +
uswrf_surface_avg +
-1, train_data)
test_data9$Predicted <- predict(hour9_m1_train, test_data9)
test_data9$Predicted[test_data9$Predicted < 0] <- 0
test_data9$Predicted[test_data9$Predicted > 10] <- 10
ggplot(test_data9 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour9_m1_train)
Call:
lm(formula = production ~ lag2_prod + lag3_prod + month + dswrf_surface_avg +
tmp_surface_avg + dlwrf_surface_avg + csnow_surface_avg +
tcdc_entire.atmosphere_avg + tcdc_high.cloud.layer_avg +
tcdc_low.cloud.layer_avg + tcdc_middle.cloud.layer_avg +
uswrf_top_of_atmosphere_avg + uswrf_surface_avg + -1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-7.3239 -0.8367 0.3086 1.0593 5.9052
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 0.0719360 0.0291785 2.465 0.0139 *
lag3_prod 0.0381950 0.0288615 1.323 0.1861
month1 -6.4558672 10.0841063 -0.640 0.5222
month2 -4.2337463 10.0945582 -0.419 0.6750
month3 -4.0304970 10.1629336 -0.397 0.6918
month4 -5.7159298 10.1855414 -0.561 0.5748
month5 -5.2696581 10.1531630 -0.519 0.6039
month6 -5.1177487 10.1533903 -0.504 0.6144
month7 -5.2424290 10.3528826 -0.506 0.6127
month8 -6.2625482 10.5598252 -0.593 0.5533
month9 -5.3892556 10.4854580 -0.514 0.6074
month10 -5.3017915 10.3207038 -0.514 0.6076
month11 -5.8549121 10.2060833 -0.574 0.5664
month12 -6.2204900 10.1143113 -0.615 0.5387
dswrf_surface_avg 0.0011729 0.0029433 0.399 0.6904
tmp_surface_avg 0.0564788 0.0403114 1.401 0.1616
dlwrf_surface_avg -0.0126738 0.0065456 -1.936 0.0532 .
csnow_surface_avg 0.2158874 0.5938470 0.364 0.7163
tcdc_entire.atmosphere_avg -0.0053210 0.0060129 -0.885 0.3765
tcdc_high.cloud.layer_avg -0.0014737 0.0052999 -0.278 0.7810
tcdc_low.cloud.layer_avg -0.0347388 0.0076623 -4.534 6.77e-06 ***
tcdc_middle.cloud.layer_avg -0.0187736 0.0045772 -4.102 4.56e-05 ***
uswrf_top_of_atmosphere_avg 0.0006273 0.0034463 0.182 0.8556
uswrf_surface_avg -0.0046808 0.0042759 -1.095 0.2740
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.94 on 734 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.9421, Adjusted R-squared: 0.9402
F-statistic: 497.8 on 24 and 734 DF, p-value: < 2.2e-16
checkresiduals(hour9_m1_train)
Breusch-Godfrey test for serial correlation of order up to 27
data: Residuals
LM test = 40.374, df = 27, p-value = 0.04725
accuracy_metrics <- test_data9[,accu(production, Predicted)]
accuracy_metrics
NA
hour9_m2_train <- auto.arima(hour9$production)
forecasted_values <- forecast(hour9_m2_train, h = nrow(test_data9))
test_data9$Predicted <- forecasted_values$mean
test_data9$Predicted[test_data9$Predicted < 0] <- 0
test_data9$Predicted[test_data9$Predicted > 10] <- 10
ggplot(test_data9 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour9_m2_train)
Series: hour9$production
ARIMA(3,1,3)
Coefficients:
ar1 ar2 ar3 ma1 ma2 ma3
-1.2156 -0.3152 0.2603 0.5456 -0.6462 -0.7200
s.e. 0.8145 0.5968 0.1630 0.8334 0.0704 0.7316
sigma^2 = 6.584: log likelihood = -2040.31
AIC=4094.63 AICc=4094.76 BIC=4127.97
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.03611137 2.555531 1.95991 -Inf Inf 0.9453011 -0.009330891
checkresiduals(hour9_m2_train)
Ljung-Box test
data: Residuals from ARIMA(3,1,3)
Q* = 8.0641, df = 4, p-value = 0.08926
Model df: 6. Total lags used: 10
accuracy_metrics <- test_data9[,accu(production, Predicted)]
accuracy_metrics
NA
In the case of the 9 AM model, we observed key differences among ARIMA and time series linear regression models:
Autocorrelation Function (ACF): The ARIMA model exhibited a well-behaved ACF within the confidence bounds, indicating no significant autocorrelation in the residuals. Conversely, the time series linear regression model displayed a spike at lag 1, suggesting potential remaining correlation in its residuals.
Weighted Mean Absolute Percentage Error (WMAPE): The time series linear regression model achieved a lower WMAPE (0.180) compared to the ARIMA model (0.275). This indicates a more accurate representation of the actual production values by the linear regression model for the specific time point of 9 AM.
Based on these observations, we selected the time series linear regression model for hourly production forecasting at 9 AM. However, the exploration continues to identify potentially better models that can further improve error metrics and achieve a more ideal ACF behavior.
This section details the process of refining the time series linear regression model for improved accuracy.
Variable Selection:
-TCDC Variables: We observed high significance for tcdc_low.cloud.layer_avg and tcdc_middle.cloud.layer_avg, while other TCDC variables exhibited lower significance. Considering the high correlation within TCDC variables, we aimed to eliminate two of them. Based on this goal, tcdc_entire.atmosphere_avg and tcdc_high.atmosphere_avg were removed.
-Additional Variable Elimination: Variables with low significance, including dswrf_surface_avg, uswrf_top_of_atmosphere_avg, and uswrf_surface_avg, were excluded. Their removal did not negatively impact the model’s performance as measured by ACF, WMAPE, or adjusted R-squared.
-Maintaining Variables for Further Analysis: Despite not reaching high significance levels, tmp_surface_avg, dlwrf_surface_avg, and lag 3 were retained due to their non-negligible impact. We will continue to monitor their influence during further model modifications. In contrast, lag 2 demonstrated a strong impact and was retained due to its potential to capture production trends from the previous hour, which is valuable for an hourly model.
-Dummy Variables: Preliminary analysis suggests that dummy variables might not be necessary in the current model. For now, they will be excluded, but further investigation is needed to confirm this decision.
hour9 <- subset(merged_data, hour %in% c(9))
hour9$lag2_prod <- hour9[ , .(lag2_prod = shift(hour9$production,n = 2,fill = NA))]
hour9$lag3_prod <- hour9[ , .(lag3_prod = shift(hour9$production,n = 3,fill = NA))]
test_start_index <- which(hour9$date == as.Date("2024-02-01"))
train_data <- hour9[1:(test_start_index - 1), ]
test_data9 <- hour9[test_start_index:(nrow(hour9)), ]
hour9_m1_train <- lm(production ~
lag2_prod +
lag3_prod +
#month +
#dswrf_surface_avg +
tmp_surface_avg +
dlwrf_surface_avg +
#csnow_surface_avg +
#tcdc_entire.atmosphere_avg +
#tcdc_high.cloud.layer_avg +
tcdc_low.cloud.layer_avg +
tcdc_middle.cloud.layer_avg +
#uswrf_top_of_atmosphere_avg +
#uswrf_surface_avg +
-1, train_data)
test_data9$Predicted <- predict(hour9_m1_train, test_data9)
test_data9$Predicted[test_data9$Predicted < 0] <- 0
test_data9$Predicted[test_data9$Predicted > 10] <- 10
ggplot(test_data9 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour9_m1_train)
Call:
lm(formula = production ~ lag2_prod + lag3_prod + tmp_surface_avg +
dlwrf_surface_avg + tcdc_low.cloud.layer_avg + tcdc_middle.cloud.layer_avg +
-1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-7.1893 -0.9803 0.3613 1.0730 6.7550
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 0.121634 0.029941 4.062 5.37e-05 ***
lag3_prod 0.070917 0.029647 2.392 0.0170 *
tmp_surface_avg 0.029171 0.002140 13.631 < 2e-16 ***
dlwrf_surface_avg -0.004945 0.002409 -2.053 0.0404 *
tcdc_low.cloud.layer_avg -0.037752 0.003213 -11.750 < 2e-16 ***
tcdc_middle.cloud.layer_avg -0.027445 0.003238 -8.476 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.044 on 752 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.9342, Adjusted R-squared: 0.9336
F-statistic: 1779 on 6 and 752 DF, p-value: < 2.2e-16
checkresiduals(hour9_m1_train)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 73.065, df = 10, p-value = 1.131e-11
accuracy_metrics <- test_data9[,accu(production, Predicted)]
accuracy_metrics
NA
While our variable selection process yielded a model with similar adjusted R-squared compared to the base model, we observed an increase in WMAPE. Additionally, the ACF plot revealed persistent seasonality. This suggests that even though monthly dummy variables didn’t appear significant in the initial model, their inclusion might be necessary to address seasonality and potentially improve WMAPE. We will revisit the inclusion of these variables in the next stage of model refinement.
The results for final model can be seen below.
hour9 <- subset(merged_data, hour %in% c(9))
hour9$lag2_prod <- hour9[ , .(lag2_prod = shift(hour9$production,n = 2,fill = NA))]
hour9$lag3_prod <- hour9[ , .(lag3_prod = shift(hour9$production,n = 3,fill = NA))]
test_start_index <- which(hour9$date == as.Date("2024-02-01"))
train_data <- hour9[1:(test_start_index - 1), ]
test_data9 <- hour9[test_start_index:(nrow(hour9)), ]
hour9_m1_train <- lm(production ~
lag2_prod +
lag3_prod +
month +
#dswrf_surface_avg +
tmp_surface_avg +
dlwrf_surface_avg +
#csnow_surface_avg +
#tcdc_entire.atmosphere_avg +
#tcdc_high.cloud.layer_avg +
tcdc_low.cloud.layer_avg +
tcdc_middle.cloud.layer_avg +
#uswrf_top_of_atmosphere_avg +
#uswrf_surface_avg +
-1, train_data)
test_data9$Predicted <- predict(hour9_m1_train, test_data9)
test_data9$Predicted[test_data9$Predicted < 0] <- 0
test_data9$Predicted[test_data9$Predicted > 10] <- 10
ggplot(test_data9 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour9_m1_train)
Call:
lm(formula = production ~ lag2_prod + lag3_prod + month + tmp_surface_avg +
dlwrf_surface_avg + tcdc_low.cloud.layer_avg + tcdc_middle.cloud.layer_avg +
-1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-7.3811 -0.8130 0.3234 1.0857 5.8721
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 0.074773 0.029131 2.567 0.01046 *
lag3_prod 0.040686 0.028536 1.426 0.15435
month1 -14.814200 6.723536 -2.203 0.02788 *
month2 -12.941089 6.664518 -1.942 0.05254 .
month3 -12.632918 6.820409 -1.852 0.06439 .
month4 -14.199508 6.964533 -2.039 0.04182 *
month5 -13.680651 6.974396 -1.962 0.05019 .
month6 -13.559536 6.980786 -1.942 0.05247 .
month7 -13.780243 7.107636 -1.939 0.05291 .
month8 -15.006784 7.211179 -2.081 0.03777 *
month9 -14.045545 7.126771 -1.971 0.04912 *
month10 -13.894358 6.954911 -1.998 0.04611 *
month11 -14.302617 6.828810 -2.094 0.03656 *
month12 -14.587903 6.732278 -2.167 0.03056 *
tmp_surface_avg 0.086331 0.026284 3.285 0.00107 **
dlwrf_surface_avg -0.013939 0.004988 -2.795 0.00533 **
tcdc_low.cloud.layer_avg -0.033140 0.005134 -6.455 1.95e-10 ***
tcdc_middle.cloud.layer_avg -0.020595 0.004052 -5.083 4.71e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.94 on 740 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.9416, Adjusted R-squared: 0.9402
F-statistic: 663.4 on 18 and 740 DF, p-value: < 2.2e-16
checkresiduals(hour9_m1_train)
Breusch-Godfrey test for serial correlation of order up to 21
data: Residuals
LM test = 33.254, df = 21, p-value = 0.04348
accuracy_metrics <- test_data9[,accu(production, Predicted)]
accuracy_metrics
Through a series of refinements, we have identified the optimal time series linear regression model for hourly production forecasting at 9 AM.
Model Characteristics
-Improved WMAPE: The final model achieves a lower WMAPE (0.172) compared to both the base model (0.180) and the previously modified version (0.197). This signifies a more accurate representation of actual production values.
-Reduced Seasonality: The reintroduction of monthly dummy variables effectively addressed the seasonality observed in the ACF plot. This contributes to a more robust and reliable model.
-High Adjusted R-Squared (0.94): The model demonstrates a strong explanatory power, indicating it effectively captures the relationship between the predictor variables and production.
-Significant Variables: All remaining variables within the model appear statistically significant, suggesting they contribute meaningfully to the prediction process.
Hour 5
hour5 <- subset(merged_data, hour %in% c(5))
hour5$lag2_prod <- hour5[ , .(lag2_prod = shift(hour5$production,n = 2,fill = NA))]
hour5$lag3_prod <- hour5[ , .(lag3_prod = shift(hour5$production,n = 3,fill = NA))]
test_start_index <- which(hour5$date == as.Date("2024-02-01"))
train_data <- hour5[1:(test_start_index - 1), ]
test_data5 <- hour5[test_start_index:(nrow(hour5)), ]
hour5_m1_train <- lm(production ~ lag2_prod +
lag3_prod +
tmp_surface_avg +
-1, train_data)
test_data5$Predicted <- predict(hour5_m1_train, test_data5)
test_data5$Predicted[test_data5$Predicted < 0] <- 0
test_data5$Predicted[test_data5$Predicted > 10] <- 10
ggplot(test_data5 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour5_m1_train)
Call:
lm(formula = production ~ lag2_prod + lag3_prod + tmp_surface_avg +
-1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-1.20018 -0.00914 -0.00874 -0.00846 1.80943
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 4.405e-01 3.270e-02 13.47 <2e-16 ***
lag3_prod 4.099e-01 3.271e-02 12.53 <2e-16 ***
tmp_surface_avg 3.108e-05 1.452e-05 2.14 0.0326 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1077 on 755 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.6655, Adjusted R-squared: 0.6642
F-statistic: 500.7 on 3 and 755 DF, p-value: < 2.2e-16
checkresiduals(hour5_m1_train)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 125.27, df = 10, p-value < 2.2e-16
accuracy_metrics <- test_data5[,accu(production, Predicted)]
accuracy_metrics
Hour 6
hour6 <- subset(merged_data, hour %in% c(6))
hour6$lag2_prod <- hour6[ , .(lag2_prod = shift(hour6$production,n = 2,fill = NA))]
hour6$lag3_prod <- hour6[ , .(lag3_prod = shift(hour6$production,n = 3,fill = NA))]
test_start_index <- which(hour6$date == as.Date("2024-02-01"))
train_data <- hour6[1:(test_start_index - 1), ]
test_data6 <- hour6[test_start_index:(nrow(hour6)), ]
hour6_m1_train <- lm(production ~ lag2_prod +
lag3_prod +
-1, hour6)
test_data6$Predicted <- predict(hour6_m1_train, test_data6)
test_data6$Predicted[test_data6$Predicted < 0] <- 0
test_data6$Predicted[test_data6$Predicted > 10] <- 10
ggplot(test_data6 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour6_m1_train)
Call:
lm(formula = production ~ lag2_prod + lag3_prod + -1, data = hour6)
Residuals:
Min 1Q Median 3Q Max
-2.7942 0.0000 0.0000 0.1406 5.2175
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 0.4795 0.0329 14.57 <2e-16 ***
lag3_prod 0.4011 0.0330 12.15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6616 on 861 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.6945, Adjusted R-squared: 0.6938
F-statistic: 978.9 on 2 and 861 DF, p-value: < 2.2e-16
checkresiduals(hour6_m1_train)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 251.06, df = 10, p-value < 2.2e-16
accuracy_metrics <- test_data6[,accu(production, Predicted)]
accuracy_metrics
Hour 7
hour7 <- subset(merged_data, hour %in% c(7))
hour7$lag2_prod <- hour7[ , .(lag2_prod = shift(hour7$production,n = 2,fill = NA))]
hour7$lag3_prod <- hour7[ , .(lag3_prod = shift(hour7$production,n = 3,fill = NA))]
test_start_index <- which(hour7$date == as.Date("2024-02-01"))
train_data <- hour7[1:(test_start_index - 1), ]
test_data7 <- hour7[test_start_index:(nrow(hour7)), ]
hour7_m1_train <- lm(production ~ lag2_prod +
lag3_prod +
dlwrf_surface_avg +
tmp_surface_avg +
month +
-1, train_data)
test_data7$Predicted <- predict(hour7_m1_train, test_data7)
test_data7$Predicted[test_data7$Predicted < 0] <- 0
test_data7$Predicted[test_data7$Predicted > 10] <- 10
ggplot(test_data7 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour7_m1_train)
Call:
lm(formula = production ~ lag2_prod + lag3_prod + dlwrf_surface_avg +
tmp_surface_avg + month + -1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-4.9535 -0.7897 0.1702 0.8267 9.0257
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 0.256155 0.035752 7.165 1.88e-12 ***
lag3_prod 0.094735 0.035683 2.655 0.0081 **
dlwrf_surface_avg -0.017236 0.001912 -9.017 < 2e-16 ***
tmp_surface_avg 0.085337 0.011686 7.302 7.30e-13 ***
month1 -19.386464 3.202954 -6.053 2.26e-09 ***
month2 -18.822991 3.174456 -5.930 4.66e-09 ***
month3 -18.186138 3.241408 -5.611 2.85e-08 ***
month4 -18.027165 3.379005 -5.335 1.27e-07 ***
month5 -17.664952 3.426950 -5.155 3.26e-07 ***
month6 -17.202595 3.476301 -4.949 9.26e-07 ***
month7 -17.537624 3.533063 -4.964 8.58e-07 ***
month8 -18.782987 3.618610 -5.191 2.71e-07 ***
month9 -18.154850 3.530485 -5.142 3.48e-07 ***
month10 -18.012207 3.418399 -5.269 1.80e-07 ***
month11 -18.808982 3.320971 -5.664 2.12e-08 ***
month12 -19.378535 3.259071 -5.946 4.23e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.439 on 742 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.8445, Adjusted R-squared: 0.8411
F-statistic: 251.8 on 16 and 742 DF, p-value: < 2.2e-16
checkresiduals(hour7_m1_train)
Breusch-Godfrey test for serial correlation of order up to 19
data: Residuals
LM test = 111.09, df = 19, p-value = 5.017e-15
accuracy_metrics <- test_data7[,accu(production, Predicted)]
accuracy_metrics
Hour 8
hour8 <- subset(merged_data, hour %in% c(8))
hour8$lag2_prod <- hour8[ , .(lag2_prod = shift(hour8$production,n = 2,fill = NA))]
hour8$lag3_prod <- hour8[ , .(lag3_prod = shift(hour8$production,n = 3,fill = NA))]
test_start_index <- which(hour8$date == as.Date("2024-02-01"))
train_data <- hour8[1:(test_start_index - 1), ]
test_data8 <- hour8[test_start_index:(nrow(hour8)), ]
hour8_m1_train <- lm(production ~ lag2_prod +
lag3_prod +
dswrf_surface_avg +
tcdc_low.cloud.layer_avg +
tcdc_middle.cloud.layer_avg +
tmp_surface_avg +
-1, train_data)
test_data8$Predicted <- predict(hour8_m1_train, test_data8)
test_data8$Predicted[test_data8$Predicted < 0] <- 0
test_data8$Predicted[test_data8$Predicted > 10] <- 10
ggplot(test_data8 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour8_m1_train)
Call:
lm(formula = production ~ lag2_prod + lag3_prod + dswrf_surface_avg +
tcdc_low.cloud.layer_avg + tcdc_middle.cloud.layer_avg +
tmp_surface_avg + -1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-6.9561 -1.1102 0.2538 1.1827 5.3256
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 0.1857247 0.0305255 6.084 1.86e-09 ***
lag3_prod 0.0701129 0.0299798 2.339 0.0196 *
dswrf_surface_avg 0.0038090 0.0004567 8.341 3.51e-16 ***
tcdc_low.cloud.layer_avg -0.0255368 0.0029795 -8.571 < 2e-16 ***
tcdc_middle.cloud.layer_avg -0.0119863 0.0030266 -3.960 8.19e-05 ***
tmp_surface_avg 0.0088008 0.0010105 8.709 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.837 on 752 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.9104, Adjusted R-squared: 0.9097
F-statistic: 1274 on 6 and 752 DF, p-value: < 2.2e-16
checkresiduals(hour8_m1_train)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 94.22, df = 10, p-value = 7.769e-16
accuracy_metrics <- test_data8[,accu(production, Predicted)]
accuracy_metrics
Hour 9
hour9 <- subset(merged_data, hour %in% c(9))
hour9$lag2_prod <- hour9[ , .(lag2_prod = shift(hour9$production,n = 2,fill = NA))]
hour9$lag3_prod <- hour9[ , .(lag3_prod = shift(hour9$production,n = 3,fill = NA))]
test_start_index <- which(hour9$date == as.Date("2024-02-01"))
train_data <- hour9[1:(test_start_index - 1), ]
test_data9 <- hour9[test_start_index:(nrow(hour9)), ]
hour9_m1_train <- lm(production ~ lag2_prod +
lag3_prod +
tcdc_low.cloud.layer_avg +
tcdc_middle.cloud.layer_avg +
uswrf_top_of_atmosphere_avg +
month +
-1, train_data)
test_data9$Predicted <- predict(hour9_m1_train, test_data9)
test_data9$Predicted[test_data9$Predicted < 0] <- 0
test_data9$Predicted[test_data9$Predicted > 10] <- 10
ggplot(test_data9 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour9_m1_train)
Call:
lm(formula = production ~ lag2_prod + lag3_prod + tcdc_low.cloud.layer_avg +
tcdc_middle.cloud.layer_avg + uswrf_top_of_atmosphere_avg +
month + -1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-7.5964 -0.8653 0.3651 1.0698 6.4259
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 0.073272 0.028689 2.554 0.01085 *
lag3_prod 0.042503 0.028153 1.510 0.13155
tcdc_low.cloud.layer_avg -0.038403 0.004416 -8.696 < 2e-16 ***
tcdc_middle.cloud.layer_avg -0.024226 0.003436 -7.051 4.06e-12 ***
uswrf_top_of_atmosphere_avg -0.004697 0.001743 -2.695 0.00719 **
month1 7.312107 0.382843 19.099 < 2e-16 ***
month2 9.512590 0.562509 16.911 < 2e-16 ***
month3 10.291389 0.571088 18.021 < 2e-16 ***
month4 8.926082 0.529565 16.856 < 2e-16 ***
month5 9.421316 0.533969 17.644 < 2e-16 ***
month6 9.383850 0.532958 17.607 < 2e-16 ***
month7 9.576374 0.544679 17.582 < 2e-16 ***
month8 8.534700 0.516935 16.510 < 2e-16 ***
month9 9.344933 0.524668 17.811 < 2e-16 ***
month10 8.970869 0.498031 18.013 < 2e-16 ***
month11 8.082484 0.442542 18.264 < 2e-16 ***
month12 7.436882 0.400717 18.559 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.944 on 741 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.9413, Adjusted R-squared: 0.94
F-statistic: 699.4 on 17 and 741 DF, p-value: < 2.2e-16
checkresiduals(hour9_m1_train)
Breusch-Godfrey test for serial correlation of order up to 20
data: Residuals
LM test = 33.271, df = 20, p-value = 0.0315
accuracy_metrics <- test_data9[,accu(production, Predicted)]
accuracy_metrics
Hour 10
hour10 <- subset(merged_data, hour %in% c(10))
hour10$lag2_prod <- hour10[ , .(lag2_prod = shift(hour10$production,n = 2,fill = NA))]
hour10$lag3_prod <- hour10[ , .(lag3_prod = shift(hour10$production,n = 3,fill = NA))]
test_start_index <- which(hour10$date == as.Date("2024-02-01"))
train_data <- hour10[1:(test_start_index - 1), ]
test_data10 <- hour10[test_start_index:(nrow(hour10)), ]
hour10_m1_train <- lm(production ~ lag2_prod +
tcdc_low.cloud.layer_avg +
tcdc_middle.cloud.layer_avg +
tmp_surface_avg +
-1, train_data)
test_data10$Predicted <- predict(hour10_m1_train, test_data10)
test_data10$Predicted[test_data10$Predicted < 0] <- 0
test_data10$Predicted[test_data10$Predicted > 10] <- 10
ggplot(test_data10 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour10_m1_train)
Call:
lm(formula = production ~ lag2_prod + tcdc_low.cloud.layer_avg +
tcdc_middle.cloud.layer_avg + tmp_surface_avg + -1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-9.2157 -0.6703 0.2466 1.0902 6.7354
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 0.0911665 0.0273433 3.334 0.000897 ***
tcdc_low.cloud.layer_avg -0.0388074 0.0033031 -11.749 < 2e-16 ***
tcdc_middle.cloud.layer_avg -0.0293496 0.0033485 -8.765 < 2e-16 ***
tmp_surface_avg 0.0291293 0.0008409 34.641 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.13 on 755 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.9377, Adjusted R-squared: 0.9374
F-statistic: 2842 on 4 and 755 DF, p-value: < 2.2e-16
checkresiduals(hour10_m1_train)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 52.069, df = 10, p-value = 1.108e-07
accuracy_metrics <- test_data10[,accu(production, Predicted)]
accuracy_metrics
Hour 11
hour11 <- subset(merged_data, hour %in% c(11))
hour11$lag2_prod <- hour11[ , .(lag2_prod = shift(hour11$production,n = 2,fill = NA))]
hour11$lag3_prod <- hour11[ , .(lag3_prod = shift(hour11$production,n = 3,fill = NA))]
test_start_index <- which(hour11$date == as.Date("2024-02-01"))
train_data <- hour11[1:(test_start_index - 1), ]
test_data11 <- hour11[test_start_index:(nrow(hour11)), ]
hour11_m1_train <- lm(production ~ lag2_prod +
tcdc_low.cloud.layer_avg +
tcdc_middle.cloud.layer_avg +
tmp_surface_avg +
-1, train_data)
test_data11$Predicted <- predict(hour11_m1_train, test_data11)
test_data11$Predicted[test_data11$Predicted < 0] <- 0
test_data11$Predicted[test_data11$Predicted > 10] <- 10
ggplot(test_data11 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour11_m1_train)
Call:
lm(formula = production ~ lag2_prod + tcdc_low.cloud.layer_avg +
tcdc_middle.cloud.layer_avg + tmp_surface_avg + -1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-9.6132 -0.5252 0.2649 1.0753 6.9656
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 0.1419115 0.0273830 5.182 2.81e-07 ***
tcdc_low.cloud.layer_avg -0.0382137 0.0032693 -11.689 < 2e-16 ***
tcdc_middle.cloud.layer_avg -0.0276353 0.0033682 -8.205 9.97e-16 ***
tmp_surface_avg 0.0282623 0.0008595 32.884 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.119 on 754 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.9397, Adjusted R-squared: 0.9394
F-statistic: 2937 on 4 and 754 DF, p-value: < 2.2e-16
checkresiduals(hour11_m1_train)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 65.562, df = 10, p-value = 3.164e-10
accuracy_metrics <- test_data11[,accu(production, Predicted)]
accuracy_metrics
Hour 12
hour12 <- subset(merged_data, hour %in% c(12))
hour12$lag2_prod <- hour12[ , .(lag2_prod = shift(hour12$production,n = 2,fill = NA))]
hour12$lag3_prod <- hour12[ , .(lag3_prod = shift(hour12$production,n = 3,fill = NA))]
test_start_index <- which(hour12$date == as.Date("2024-02-01"))
train_data <- hour12[1:(test_start_index - 1), ]
test_data12 <- hour12[test_start_index:(nrow(hour12)), ]
hour12_m1_train <- lm(production ~ tcdc_low.cloud.layer_avg +
tcdc_middle.cloud.layer_avg +
tcdc_entire.atmosphere_avg +
uswrf_surface_avg +
tmp_surface_avg +
-1, train_data)
test_data12$Predicted <- predict(hour12_m1_train, test_data12)
test_data12$Predicted[test_data12$Predicted < 0] <- 0
test_data12$Predicted[test_data12$Predicted > 10] <- 10
ggplot(test_data12 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour12_m1_train)
Call:
lm(formula = production ~ tcdc_low.cloud.layer_avg + tcdc_middle.cloud.layer_avg +
tcdc_entire.atmosphere_avg + uswrf_surface_avg + tmp_surface_avg +
-1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-9.3931 -0.6361 0.2859 1.0970 5.8946
Coefficients:
Estimate Std. Error t value Pr(>|t|)
tcdc_low.cloud.layer_avg -0.0396984 0.0031781 -12.491 < 2e-16 ***
tcdc_middle.cloud.layer_avg -0.0087877 0.0034410 -2.554 0.0108 *
tcdc_entire.atmosphere_avg -0.0080573 0.0032554 -2.475 0.0135 *
uswrf_surface_avg 0.0083429 0.0013592 6.138 1.35e-09 ***
tmp_surface_avg 0.0299954 0.0006937 43.237 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.071 on 756 degrees of freedom
Multiple R-squared: 0.9407, Adjusted R-squared: 0.9403
F-statistic: 2399 on 5 and 756 DF, p-value: < 2.2e-16
checkresiduals(hour12_m1_train)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 47.837, df = 10, p-value = 6.645e-07
accuracy_metrics <- test_data12[,accu(production, Predicted)]
accuracy_metrics
Hour 13
hour13 <- subset(merged_data, hour %in% c(13))
hour13$lag2_prod <- hour13[ , .(lag2_prod = shift(hour13$production,n = 2,fill = NA))]
hour13$lag3_prod <- hour13[ , .(lag3_prod = shift(hour13$production,n = 3,fill = NA))]
test_start_index <- which(hour13$date == as.Date("2024-02-01"))
train_data <- hour13[1:(test_start_index - 1), ]
test_data13 <- hour13[test_start_index:(nrow(hour13)), ]
hour13_m1_train <- lm(production ~ tcdc_low.cloud.layer_avg +
tcdc_entire.atmosphere_avg +
uswrf_surface_avg +
tmp_surface_avg +
-1, train_data)
test_data13$Predicted <- predict(hour13_m1_train, test_data13)
test_data13$Predicted[test_data13$Predicted < 0] <- 0
test_data13$Predicted[test_data13$Predicted > 10] <- 10
ggplot(test_data13 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour13_m1_train)
Call:
lm(formula = production ~ tcdc_low.cloud.layer_avg + tcdc_entire.atmosphere_avg +
uswrf_surface_avg + tmp_surface_avg + -1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-8.6330 -0.9171 0.2336 1.2814 6.5912
Coefficients:
Estimate Std. Error t value Pr(>|t|)
tcdc_low.cloud.layer_avg -0.039748 0.003293 -12.070 < 2e-16 ***
tcdc_entire.atmosphere_avg -0.019778 0.002769 -7.143 2.15e-12 ***
uswrf_surface_avg 0.015020 0.001613 9.311 < 2e-16 ***
tmp_surface_avg 0.028168 0.000741 38.014 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.125 on 757 degrees of freedom
Multiple R-squared: 0.9319, Adjusted R-squared: 0.9316
F-statistic: 2591 on 4 and 757 DF, p-value: < 2.2e-16
checkresiduals(hour13_m1_train)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 43.124, df = 10, p-value = 4.725e-06
accuracy_metrics <- test_data13[,accu(production, Predicted)]
accuracy_metrics
Hour 14
hour14 <- subset(merged_data, hour %in% c(14))
hour14$lag2_prod <- hour14[ , .(lag2_prod = shift(hour14$production,n = 2,fill = NA))]
hour14$lag3_prod <- hour14[ , .(lag3_prod = shift(hour14$production,n = 3,fill = NA))]
test_start_index <- which(hour14$date == as.Date("2024-02-01"))
train_data <- hour14[1:(test_start_index - 1), ]
test_data14 <- hour14[test_start_index:(nrow(hour14)), ]
hour14_m1_train <- lm(production ~ lag3_prod +
dswrf_surface_avg +
tcdc_low.cloud.layer_avg +
tcdc_entire.atmosphere_avg +
uswrf_top_of_atmosphere_avg +
tmp_surface_avg +
-1, train_data)
test_data14$Predicted <- predict(hour14_m1_train, test_data14)
test_data14$Predicted[test_data14$Predicted < 0] <- 0
test_data14$Predicted[test_data14$Predicted > 10] <- 10
ggplot(test_data14 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour14_m1_train)
Call:
lm(formula = production ~ lag3_prod + dswrf_surface_avg + tcdc_low.cloud.layer_avg +
tcdc_entire.atmosphere_avg + uswrf_top_of_atmosphere_avg +
tmp_surface_avg + -1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-8.795 -1.100 0.159 1.315 5.892
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag3_prod 0.0533049 0.0278663 1.913 0.0561 .
dswrf_surface_avg 0.0060677 0.0007015 8.650 < 2e-16 ***
tcdc_low.cloud.layer_avg -0.0221299 0.0036183 -6.116 1.54e-09 ***
tcdc_entire.atmosphere_avg -0.0229476 0.0031903 -7.193 1.53e-12 ***
uswrf_top_of_atmosphere_avg 0.0024050 0.0013609 1.767 0.0776 .
tmp_surface_avg 0.0190032 0.0010039 18.929 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.044 on 752 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.9175, Adjusted R-squared: 0.9168
F-statistic: 1394 on 6 and 752 DF, p-value: < 2.2e-16
checkresiduals(hour14_m1_train)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 59.894, df = 10, p-value = 3.796e-09
accuracy_metrics <- test_data14[,accu(production, Predicted)]
accuracy_metrics
Hour 15
hour15 <- subset(merged_data, hour %in% c(15))
hour15$lag2_prod <- hour15[ , .(lag2_prod = shift(hour15$production,n = 2,fill = NA))]
hour15$lag3_prod <- hour15[ , .(lag3_prod = shift(hour15$production,n = 3,fill = NA))]
test_start_index <- which(hour15$date == as.Date("2024-02-01"))
train_data <- hour15[1:(test_start_index - 1), ]
test_data15 <- hour15[test_start_index:(nrow(hour15)), ]
hour15_m1_train <- lm(production ~ lag2_prod +
dswrf_surface_avg +
tcdc_middle.cloud.layer_avg +
tcdc_entire.atmosphere_avg +
tmp_surface_avg +
month +
-1, train_data)
test_data15$Predicted <- predict(hour15_m1_train, test_data15)
test_data15$Predicted[test_data15$Predicted < 0] <- 0
test_data15$Predicted[test_data15$Predicted > 10] <- 10
ggplot(test_data15 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour15_m1_train)
Call:
lm(formula = production ~ lag2_prod + dswrf_surface_avg + tcdc_middle.cloud.layer_avg +
tcdc_entire.atmosphere_avg + tmp_surface_avg + month + -1,
data = train_data)
Residuals:
Min 1Q Median 3Q Max
-6.7424 -0.9160 0.0113 0.9343 6.2277
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 0.099189 0.028821 3.442 0.000611 ***
dswrf_surface_avg 0.011833 0.001317 8.982 < 2e-16 ***
tcdc_middle.cloud.layer_avg -0.009993 0.003089 -3.235 0.001272 **
tcdc_entire.atmosphere_avg -0.013486 0.003048 -4.425 1.11e-05 ***
tmp_surface_avg 0.020272 0.015499 1.308 0.191273
month1 -3.375350 4.145215 -0.814 0.415748
month2 -2.706135 4.092772 -0.661 0.508690
month3 -2.705891 4.160422 -0.650 0.515643
month4 -3.497065 4.296411 -0.814 0.415934
month5 -3.176526 4.347950 -0.731 0.465266
month6 -3.766188 4.427296 -0.851 0.395225
month7 -4.829015 4.484968 -1.077 0.281959
month8 -3.597321 4.511107 -0.797 0.425453
month9 -3.225827 4.414673 -0.731 0.465190
month10 -3.563014 4.364395 -0.816 0.414544
month11 -3.837194 4.293210 -0.894 0.371728
month12 -4.245609 4.231262 -1.003 0.315999
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.633 on 742 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.9119, Adjusted R-squared: 0.9099
F-statistic: 451.7 on 17 and 742 DF, p-value: < 2.2e-16
checkresiduals(hour15_m1_train)
Breusch-Godfrey test for serial correlation of order up to 20
data: Residuals
LM test = 73.613, df = 20, p-value = 4.632e-08
accuracy_metrics <- test_data15[,accu(production, Predicted)]
accuracy_metrics
Hour 16
hour16 <- subset(merged_data, hour %in% c(16))
hour16$lag2_prod <- hour16[ , .(lag2_prod = shift(hour16$production,n = 2,fill = NA))]
hour16$lag3_prod <- hour16[ , .(lag3_prod = shift(hour16$production,n = 3,fill = NA))]
test_start_index <- which(hour16$date == as.Date("2024-02-01"))
train_data <- hour16[1:(test_start_index - 1), ]
test_data16 <- hour16[test_start_index:(nrow(hour16)), ]
hour16_m1_train <- lm(production ~ lag2_prod +
lag3_prod +
dswrf_surface_avg +
tcdc_entire.atmosphere_avg +
tmp_surface_avg +
-1, train_data)
test_data16$Predicted <- predict(hour16_m1_train, test_data16)
test_data16$Predicted[test_data16$Predicted < 0] <- 0
test_data16$Predicted[test_data16$Predicted > 10] <- 10
ggplot(test_data16 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour16_m1_train)
Call:
lm(formula = production ~ lag2_prod + lag3_prod + dswrf_surface_avg +
tcdc_entire.atmosphere_avg + tmp_surface_avg + -1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-4.7887 -0.7161 -0.1633 0.4806 8.5590
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 0.1918964 0.0350410 5.476 5.92e-08 ***
lag3_prod 0.1600983 0.0340043 4.708 2.98e-06 ***
dswrf_surface_avg 0.0104546 0.0007343 14.238 < 2e-16 ***
tcdc_entire.atmosphere_avg -0.0029772 0.0016260 -1.831 0.0675 .
tmp_surface_avg -0.0013007 0.0006076 -2.141 0.0326 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.437 on 753 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.824, Adjusted R-squared: 0.8228
F-statistic: 705.1 on 5 and 753 DF, p-value: < 2.2e-16
checkresiduals(hour16_m1_train)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 110.48, df = 10, p-value < 2.2e-16
accuracy_metrics <- test_data16[,accu(production, Predicted)]
accuracy_metrics
Hour 17
Lag-7 is also added to this model as weekly seasonality was observed in the ACF of residuals.
hour17 <- subset(merged_data, hour %in% c(17))
hour17$lag2_prod <- hour17[ , .(lag2_prod = shift(hour17$production,n = 2,fill = NA))]
hour17$lag3_prod <- hour17[ , .(lag3_prod = shift(hour17$production,n = 3,fill = NA))]
hour17$lag7_prod <- hour17[ , .(lag7_prod = shift(hour17$production,n = 7,fill = NA))]
test_start_index <- which(hour17$date == as.Date("2024-02-01"))
train_data <- hour17[1:(test_start_index - 1), ]
test_data17 <- hour17[test_start_index:(nrow(hour17)), ]
hour17_m1_train <- lm(production ~ lag2_prod +
lag3_prod +
lag7_prod +
dswrf_surface_avg +
tcdc_high.cloud.layer_avg +
uswrf_surface_avg +
tmp_surface_avg +
-1, train_data)
test_data17$Predicted <- predict(hour17_m1_train, test_data17)
test_data17$Predicted[test_data17$Predicted < 0] <- 0
test_data17$Predicted[test_data17$Predicted > 10] <- 10
ggplot(test_data17 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour17_m1_train)
Call:
lm(formula = production ~ lag2_prod + lag3_prod + lag7_prod +
dswrf_surface_avg + tcdc_high.cloud.layer_avg + uswrf_surface_avg +
tmp_surface_avg + -1, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-2.9319 -0.3013 -0.0313 0.1218 5.1095
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 0.2777948 0.0376355 7.381 4.19e-13 ***
lag3_prod 0.1479213 0.0368348 4.016 6.52e-05 ***
lag7_prod 0.3080590 0.0314149 9.806 < 2e-16 ***
dswrf_surface_avg 0.0030624 0.0006416 4.773 2.18e-06 ***
tcdc_high.cloud.layer_avg -0.0010768 0.0009430 -1.142 0.2539
uswrf_surface_avg -0.0019433 0.0021919 -0.887 0.3756
tmp_surface_avg -0.0005792 0.0002947 -1.965 0.0497 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8324 on 747 degrees of freedom
(7 observations deleted due to missingness)
Multiple R-squared: 0.7496, Adjusted R-squared: 0.7473
F-statistic: 319.5 on 7 and 747 DF, p-value: < 2.2e-16
checkresiduals(hour17_m1_train)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 124.18, df = 10, p-value < 2.2e-16
accuracy_metrics <- test_data17[,accu(production, Predicted)]
accuracy_metrics
Hour 18
hour18 <- subset(merged_data, hour %in% c(18))
hour18$lag2_prod <- hour18[ , .(lag2_prod = shift(hour18$production,n = 2,fill = NA))]
hour18$lag3_prod <- hour18[ , .(lag3_prod = shift(hour18$production,n = 3,fill = NA))]
test_start_index <- which(hour18$date == as.Date("2024-02-01"))
train_data <- hour18[1:(test_start_index - 1), ]
test_data18 <- hour18[test_start_index:(nrow(hour18)), ]
hour18_m1_train <- lm(production ~ lag2_prod +
lag3_prod +
month +
-1, train_data)
test_data18$Predicted <- predict(hour18_m1_train, test_data18)
test_data18$Predicted[test_data18$Predicted < 0] <- 0
test_data18$Predicted[test_data18$Predicted > 10] <- 10
ggplot(test_data18 ,aes(x=date)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted'))
summary(hour18_m1_train)
Call:
lm(formula = production ~ lag2_prod + lag3_prod + month + -1,
data = train_data)
Residuals:
Min 1Q Median 3Q Max
-0.9288 -0.0241 0.0000 0.0000 3.8262
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lag2_prod 1.453e-01 3.613e-02 4.021 6.39e-05 ***
lag3_prod 1.393e-01 3.628e-02 3.839 0.000134 ***
month1 1.586e-17 2.953e-02 0.000 1.000000
month2 0.000e+00 3.744e-02 0.000 1.000000
month3 0.000e+00 3.558e-02 0.000 1.000000
month4 0.000e+00 3.617e-02 0.000 1.000000
month5 5.072e-02 3.570e-02 1.421 0.155856
month6 1.877e-01 3.743e-02 5.014 6.66e-07 ***
month7 2.562e-01 3.946e-02 6.493 1.53e-10 ***
month8 1.761e-01 3.885e-02 4.532 6.81e-06 ***
month9 -4.407e-05 3.617e-02 -0.001 0.999028
month10 0.000e+00 3.558e-02 0.000 1.000000
month11 0.000e+00 3.617e-02 0.000 1.000000
month12 0.000e+00 3.558e-02 0.000 1.000000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2801 on 744 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.2445, Adjusted R-squared: 0.2303
F-statistic: 17.2 on 14 and 744 DF, p-value: < 2.2e-16
checkresiduals(hour18_m1_train)
Breusch-Godfrey test for serial correlation of order up to 17
data: Residuals
LM test = 113.76, df = 17, p-value = 2.36e-16
accuracy_metrics <- test_data18[,accu(production, Predicted)]
accuracy_metrics
Below, the results and related charts are presented. Our model is designed to minimize the WMAPE score. The final WMAPE score from February 1st to May 15th is 0.2368423. The Predicted vs. Real Production graphs are also shown below. In the Predicted vs. Real Production graph, the model is considered better if the points are close to the red line. Although the points in our model are sometimes scattered, they generally appear to be around the red line. Although our model occasionally underpredicted the actual results, the best performance was achieved by building hourly models and then combining them.
test_data5$date <- as.Date(test_data5$date)
test_data6$date <- as.Date(test_data6$date)
test_data <- rbind(test_data5[,.(date,hour,production,Predicted)],
test_data6[,.(date,hour,production,Predicted)],
test_data7[,.(date,hour,production,Predicted)],
test_data8[,.(date,hour,production,Predicted)],
test_data9[,.(date,hour,production,Predicted)],
test_data10[,.(date,hour,production,Predicted)],
test_data11[,.(date,hour,production,Predicted)],
test_data12[,.(date,hour,production,Predicted)],
test_data13[,.(date,hour,production,Predicted)],
test_data14[,.(date,hour,production,Predicted)],
test_data15[,.(date,hour,production,Predicted)],
test_data16[,.(date,hour,production,Predicted)],
test_data17[,.(date,hour,production,Predicted)],
test_data18[,.(date,hour,production,Predicted)])
test_data <- test_data[order(date,hour)]
test_data
accuracy_metrics <- accu(test_data$production, test_data$Predicted)
accuracy_metrics
test_data$datetime <- as.POSIXct(paste(test_data$date, test_data$hour), format="%Y-%m-%d %H")
ggplot(test_data[1:96], aes(x = datetime)) +
geom_line(aes(y=production,color='real')) +
geom_line(aes(y=Predicted,color='predicted')) +
labs(title = "Predicted vs Real Production over Time", x = "Datetime", y = "Production", color = "Legend")
ggplot(test_data, aes(x = Predicted, y = production)) +
geom_point() +
geom_abline(color = "red") +
labs(title = "Predicted vs Real Production", x = "Predicted", y = "Real")
NA
NA
Our final model gave us a WMAPE of 0.2368423 which can be interpreted as a decent result. However for some of the hours, we could not conclude that the residuals are not autocorrelated because when Breusch-Godfrey test was implemented, some of the lags were above the significance level. Also, especially for the very early and very late hours (hours 5,6 and 17,18) the WMAPE score was significantly higher. Moreover, the data fluctuated absurdly for some days for a given hour, therefore this caused our model to under predict most of the time. Therefore the model can be improved by trying other strategies and modelling techniques as well. Below are some other future work that may improve the model:
Other dummy variables or nonlinear relations could be added in order to avoid the under prediction of the results.
We didn’t try SARIMA models when trying alternatives for the hourly modelling strategy. If this approach is also utilized, it may give better results for some of the hours.
Other methods which are not instructed in the course may be applied such as GBMs or Prophet.
Link for the code: https://github.com/BU-IE-360/spring24-eylulgulluu/blob/main/IE360_Project_Group11_Code.Rmd